Audio Segmentation Outlier Detection with Clustering

code
outlier
clustering
Author

Daniel Hassler

Published

November 17, 2023

Author: Daniel Hassler

```{python}
import librosa
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd
import soundfile as sf
import matplotlib.pyplot as plt
import seaborn as sns
import IPython
```

Background

Outlier detection is a common use-case for clustering algorithms, especially in the realm of fraud detection, spam mail, etc. Instead of fraud detection approaches, I thought audio segmentation would be another interesting application for outlier detection with clustering. This ML approach is a challenging task given the nature of how sound is used in a computational setting, which requires some DSP (digital signal processing) background. Disclaimer: I am new to DSP.

Audio Segmentation Importance

As I was researching this problem, I found that audio outlier detection is an interesting problem that can be applied to a lot of domains. In the realm of public safety, we can use audio segmentation to separate sounds like normal city sounds, bird chirps, as well as outlier events like gunshots. This can be applied to a full-stack system where authorities can be alerted about those outlier events autonomously.

Another cool use-case for audio segmentation is music composition. We can take any song from the internet and separate the voices based on instrument; for example, guitar, bass, drums, and vocals. Clustering algorithms can be applied to this space to separate the voices, and with a full-stack system, can be applied to music editing, genre classification, or even music generation.

My Application

For my application, I am applying audio segmentation to trail camera data. For my example, I have found an under minute long sound file containing normal forest noises: rustling branches, wind, bird chirps, that also includes an outlier (uncommon) event of a screaming mountain lion. The assumption here is that in most geographical settings, we don’t experience mountain lion screaming everyday (it’s an outlier event), and the common (non-outlier) event would be the forest sounds.

```{python}
IPython.display.Audio(r"./mountain_lion_scream.wav")
```
```{python}
audio_file = "mountain_lion_scream.wav"
audio_data, sr = librosa.load(audio_file)

time = librosa.times_like(audio_data, sr=sr)

plt.figure(figsize=(10, 5))
librosa.display.waveshow(audio_data, sr=sr, alpha=0.8)
plt.title('Waveform of Forest Noise Data')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.show()
```

Above is a plot showing the signal in the time domain.

DSP Background for Problem

With the overall complicated nature of sound data, it is important to effectively process it beforehand so K-means can apply the data effectively. In order to do audio segmentation, the original audio data needs to run through some transformations first. After reading some sources online, the most widely used transformation for audio data in this space is MFCC (Mel-Frequency Cepstral Coefficients). The key takeaway with MFCCs is to capture discretized, compact, perceptual relevant information of a signal.

To compute the cepstrum, all we need is a time domain signal x(t), which is essentially just the audio data. This goes through mutliple parts: computing the DFT (discrete fourier transform) F on the signal, applying the log on that output for perceptual relevancy of loudness, and then taking the DFT inverse of it. Essentially, this is a spectrum of a spectrum, but it’s coined as a “cepstrum”: \[ C(x(t)) = F^{-1}[log(F[x(t)])] \]

To compute the MFCC, we take this idea of computing the cepstrum, but instead, change a couple parts of this formula. The first few steps are the same: taking the time domain signal, computing the DFT and applying the log, but within this step, we apply mel-scaling. Mel-scaling is a perceptual scale of pitches that approximates the human ear’s response to different frequencies and is usually in a triangular shape. After this step, instead of applying the DFT inverse on the spectrum, we apply Discrete Cosine Transform (DCT). DCT is a simplified version of Fourier Transform and is essentially a dimensionality reduction layer.

When it comes to the coefficients, each coefficient describes spectral characteristics of the audio signal and captures different aspects of the signal’s spectrum. The first few coefficients represent the overall energy of the signal, whereas the higher order coefficients capture more finer details of the spectrum. In my code block below, I am taking the first 13 coefficients, as this was in the standard range of normal use cases.

```{python}
# Extract features (MFCCs as an example)
mfccs = librosa.feature.mfcc(y=audio_data, sr=sr, n_mfcc=13)

# Transpose the feature matrix to have time on the x-axis
mfccs_transposed = np.transpose(mfccs)

print(audio_data.shape) # 1D array with 1138688 components.
print(audio_data.shape[0] / sr) # 51 second audio clip.
print(mfccs.shape) # 2225 frames, each with 13 features.
```
(1138688,)
51.641179138321995
(13, 2225)
```{python}
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time', sr=sr, cmap='viridis')
plt.colorbar(format='%+2.0f dB')
plt.title('MFCCs')
plt.xlabel('Time (s)')
plt.ylabel('MFCC Coefficients')
plt.show()
```

Importance with Clustering (K-means)

Applying this MFCC preprocessing step on the audio file is beneficial for audio segmentation clustering. There are a few advantages of applying MFCC to the data including dimensionality reduction and effective feature extraction (similar to the idea of PCA, but with audio signal data), and its frequency representation (instead of time representation), which makes it easier to identify different sounds or components within the signal.

K-means Clustering

For the purposes of treating this application as an outlier detection problem, I’ve set the number of clusters to be 2, sort of like a traditional binary classification problem (i.e. is the data fraud or not fraud). In my case, I’m setting up K-means for the idea “is the sound normal or not normal” for a wildlife observation setting.

The cool thing about K-means for this problem is that I can specify how many distinct sounds I want to pick up from based on the number of clusters. If I choose n_clusters to be 3 or 4, I am telling K-means to find 3 or 4 separate distinct signals.

```{python}
kmeans = KMeans(n_clusters=2, random_state=42)
cluster_labels = kmeans.fit_predict(mfccs_transposed)

labels = kmeans.labels_

pca = PCA(n_components=2).fit_transform(mfccs_transposed)
df_pca = pd.DataFrame(pca,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=kmeans.fit_predict(mfccs_transposed), data=df_pca)
plt.title('K-means Clustering PCA on MFCCS Data')
plt.show()
```
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)

In the above plot, I am reducing the dimensionality of the data to two components with PCA for visualization purposes.

Evaluation

```{python}
# Visualize the clusters
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time', sr=sr, hop_length=512, cmap='viridis')
plt.scatter(np.arange(len(cluster_labels)), cluster_labels, color='red', marker='x', s=30)
plt.title('MFCCs with Clustering Labels')
plt.colorbar(format='%+2.0f dB')
plt.show()
```

The above plot tells me critical details about how my K-means label assignments are chosen based on the MFCC data. As you can see, there are two clear distinct audio signals in the original audio sample that K-means tried to pick up on. This is because our K-means cluster is set up with two clusters. If I chose more clusters, there would be more horizontal lines in the data showing different separated audio segments.

Now to test the K-means cluster assignments, I store the unique voices as their cluster assignments, save them to a .wav file, and then hear if the two audios were distinctly separated.

```{python}
# Separate voices based on cluster assignments
voices = [mfccs_transposed[cluster_labels == label] for label in np.unique(cluster_labels)]

# # Save the separated voices to audio files
for i, voice_mfcc in enumerate(voices):
    voice = librosa.feature.inverse.mfcc_to_audio(voice_mfcc.T)
    sf.write(f'separated_voice_{i}.wav', voice, sr)
```
```{python}
IPython.display.Audio(r"./separated_voice_0.wav") # mostly screaming mountain lion
```
```{python}
IPython.display.Audio(r"./separated_voice_1.wav") # mostly forest sounds
```

Improvements

Obviously the sound files are not perfect, but they do show meaningful representations of segmented audio on a real world complex application. One improvement I would consider would be changing up the pre-processing phase, like tuning the parameters of the MFCC functions or how the audio is handled.

Sources

  • MFCC: https://www.youtube.com/watch?v=4_SH2nfbQZ8
  • K-means, MFCC: https://medium.com/@evertongomede/audio-segmentation-and-artificial-intelligence-a-harmonious-symphony-f472dd770b97